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We congratulate the authors (hereafter BH) for an 
interesting take on the boosting technology, and for 
developing a modular computational environment in 
R for exploring their models. Their use of low-degree- 
of-freedom smoothing splines base learner pro- 
vides an interesting approach to adaptive additive 
modeling. The notion of "Twin Boosting" is inter- 
esting as well; besides the adaptive lasso, we have 
seen the idea applied more directly for the lasso and 
Dantzig selector (James, Radchenko and Lv (2007)). 

In this discussion we elaborate on the connections 
between L2-boosting of a linear model and infinites- 
imal forward stagewise linear regression. We then 
take the authors to task on their definition of de- 
grees of freedom. 

1. Xa-BOOST AND INFINITESIMAL 
FORWARD STAGEWISE LINEAR 
REGRESSION 

Motivated by a version of L2-boosting in Chap- 
ter 10 of Hastie, Tibshirani and Friedman (2001), 
Efron, Hastie, Johnstone and Tibshirani (2004) pro- 
posed the LARS algorithm. The intent was to: 

• develop a limiting version of L2-boost in which 
the step-length went to zero; 

• show that this limiting version gave paths identi- 
cal to the lasso, as was hinted in that chapter. 

The result was three very similar varieties of the 
LARS algorithm, namely lasso, LAR and infinitesi- 
mal forward stagewise (iFSLR) (package lars for R, 
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available from CRAN). iFSLR is indeed the limit of 
L2-boost as i 0, with piecewise-linear coefficient 
profiles, but is not always the same as the lasso. 

On a slight technical note, the version of L2-boost 
proposed in BH is slightly different from that in 
Hastie, Tibshirani and Friedman (2001). Compare 
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Despite the difference, they both have the same limit, 
which is computed exactly for squared-error loss by 
the type=" forward, stagewise" option in the pack- 
age lars. As gets very small, initially the same 
coefficient tends to get continuously updated by in- 
finitesimal amounts (hence linearly). Eventually a 
second variable ties with the first for coefficient up- 
dates, which they share in a balanced way while 
remaining tied. Then a third joins in, and so on. 
Using simple least-squares computations, the LARS 
algorithm computes the entire iFSLR path with the 
same cost as a single multiple- least-squares fit. Note 
that in this limiting case, we can no longer index 
the sequence by step- number m as in (1) or (2), but 
must resort to some other measure, such as the Li- 
arc-length of the coefficient profile 
(Hastie, Taylor, Tibshirani and Walther (2007)). 

Lasso and iFSLR are not always the same. In 
high-dimensional problems with correlated predic- 
tors, lasso profiles become wiggly quickly, whereas 
iFSLR profiles tend to be much smoother and mono- 
tone (Hastie et al, 2007). Efron et al. (2004) estab- 
lish sufficient positive cone conditions on the model 
matrix X which effectively limit the amount of cor- 
relation between the variables and guarantee that 
lasso and iFSLR are the same; in particular, if the 
lasso profiles are monotone, all three algorithms are 
identical. 

2. DEGREES OF FREEDOM 

The authors propose a simple formula for the de- 
grees of freedom for an L2-boosted model. They 
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Prostate Data Haar Basis Regression 




1 2 3 4 5 6 7 B 9 10 20 30 40 SO 

True dfr(fc) True dSr{k) 

Fig. 1. The effective degrees of freedom for L2 -boost computed using the trace formula (vertical axis) vs. the exact degrees of 
freedom. The left plot is for the prostate cancer data example; the right plot is for a simulated univariate smoothing problem. 
In both cases df{m) underestimates the true degrees of freedom quite dramatically. 



construct the hat matrix Bm that computes the fit 
at iteration m, and then use df(m) = trace (jBm). 
They are in effect treating the model at stage m 
as if it were computed by a predetermined sequence 
of Hnear updates. If this were the case, their for- 
mula would be spot on, by the accepted definitions 
for effective degrees of freedom for linear operators 
(Hastie et al., 2001; Efron et al., 2004). They ac- 
knowledge that this is an approximation (since the 
sequence was not predetermined, but rather adap- 
tively chosen), but do not elaborate. In fact this ap- 
proximation can be very badly off. Figure 1 shows 
the true degrees of freedom dfT(A:) plotted against 
df(A;) for two examples. We see that df(A;) always 
underestimates dfrik). We now discuss the details 
of these examples, and the basis for these claims. 

The left example is the prostate data (Hastie et 
al., 2001, Figure 10.12) and has 67 observations and 



9 predictors (including intercept). The right exam- 
ple fits a univariate piecewise-constant spline model 
of the form f{x) = Y^^=i Pjhj{x), where the hj{x) = 
I{x > Cj) are a sequence of Haar basis functions with 
predefined knots Cj at the unique values of the input 
values Xi. There are 50 observations and 50 predic- 
tors. In both problems we fit the limiting L2-boost 
model iFSLR, using the lars/f orward. stagewise 
procedure. Figure 2 shows the coefficient profiles. 

In this case, using the results in Efron et al. (2004), 
it can be deduced that the equivalent limiting ver- 
sion of the hat matrix (5.6) of BH simplifies to a 
similar but more compact expression: 

Sfc = /-(I-7fcWfc) 

(3) 

•(/-7fe-iHfc-i)---(/-7iWi). 

Here k indexes the step number in the lars algo- 
rithm, where the steps delineate the breakpoints in 
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Fig. 2. Coefficient profiles for the iFSLR algorithm for the two examples. Both profiles are monotone, and are identical to 
the lasso profiles on these examples. In this case the df increment by 1 exactly at every vertical break-point line. 
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the piecewise-hnear path. TCj is the hat matrix corre- 
sponding to the variables involved in the jth portion 
of the piecewise linear path, and 7j is the relative 
distance in arc-length traveled along this piece un- 
til the next variable joins the active set (relative to 
the arc-length of the step that went all the way to 
the least squares fit). Using the BH definition, we 
would compute df(A:) = trace(Sfc) (vertical axis in 
Figure 1). 

These two examples were chosen carefully, for they 
both satisfy the positive cone condition mentioned 
above. In particular, the iFSLR path is the lasso 
path in both cases, and the active set grows by one 
at each step. More importantly, it is under these 
conditions that Efron et al. (2004) established that 
dfT(A;) = k + 1 exactly (horizontal axis in Figure 1). 
The -|-1 takes care of the intercept. 

Consider the first step. The dominant variable en- 
ters the model, and gets its coefficient incremented 
until we reach the point that the next competitor 
is about to enter. At this point the df is exactly 2, 
while the formula df(l) = trace(0i) = 1.48 for the 
first example in Figure 1; this is off by 25%. 

The exact df satisfies our intuition as well. If the 
first variable is far more significant than the rest, 
we will almost fit it entirely (71 ~ 1) before the next 
one enters, and at that point the model has 2df. 
There is virtually no price for searching, because 
searching was not really needed. On the other hand, 
if many variables are competing for the first slot, 
shortly after the chosen one enters, another might 
appear, long before the first is fit completely (71 <^ 
1). Here the model also has 2df, despite the fact that 
the first variable has hardly progressed at all. This 
is the price paid for selection. 



Even when the positive cone conditions are not 
satisfied, it can be shown that the size of the active 
set is an unbiased estimate of the true df 
(Zou, Hastie and Tibshirani (2007)). 

It is possible that the authors can devise a correc- 
tion for their df(/c) formula, based on the insights 
learned here. In some cases it may be possible to 
calibrate the formula to match the size of the active 
set. Failing that, one can use bootstrap methods to 
estimate df. But if the main purpose for estimating 
df is for model selection, K-fold cross-validation is a 
useful alternative. 
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